Multi-Query Optimization in MapReduce Framework
نویسندگان
چکیده
MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, finegrained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs scan the same input file or produce the same map output), there are many opportunities to optimize the performance for a batch of jobs. In this paper, we propose two new techniques for multi-job optimization in the MapReduce framework. The first is a generalized grouping technique (which generalizes the recently proposed MRShare technique) that merges multiple jobs into a single job thereby enabling the merged jobs to share both the scan of the input file as well as the communication of the common map output. The second is a materialization technique that enables multiple jobs to share both the scan of the input file as well as the communication of the common map output via partial materialization of the map output of some jobs (in the map and/or reduce phase). Our second contribution is the proposal of a new optimization algorithm that given an input batch of jobs, produces an optimal plan by a judicious partitioning of the jobs into groups and an optimal assignment of the processing technique to each group. Our experimental results on Hadoop demonstrate that our new approach significantly outperforms the state-of-the-art technique, MRShare, by up to 107%.
منابع مشابه
Simultaneous Processing of Multi-Skyline Queries with MapReduce
With rapid increase of the number of applications as well as the sizes of data, multi-query processing on the MapReduce framework has gained much attention. Meanwhile, there have been much interest in skyline query processing due to its power of multi-criteria decision making and analysis. Recently, there have been attempts to optimize multi-query processing in MapReduce. However, they are not ...
متن کاملA Lightweight Evaluation Framework for Table Layouts in MapReduce Based Query Systems
Table layout determines the way how the relational rowcolumn data values are organized and stored. In recent years, considerable candidates have been developed in MapReduce based query systems; they differ on storage space utilization, data loading time, query performance and so on. In most time, users are confronted with the problem of choosing the comprehensive optimum table layout given the ...
متن کاملComparative Study of Multi-query Optimization Techniques using Shared Predicate-based for Big Data
Big data analytical systems, such as MapReduce, have become main issues for many enterprises and research groups. Currently, multi-query which translated into MapReduce jobs is submitted repeatedly with similar tasks. So, exploiting these similar tasks can offer possibilities to avoid repeated computations of MapReduce jobs. Therefore, many researches have addressed the sharing opportunity to o...
متن کاملMRShare: Sharing Across Multiple Queries in MapReduce
Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batc...
متن کاملMultidimensional Aggregation Process in Cloud Computing System
This paper presents multidimensional aggregation query processing algorithm in cloud computing system. The existing cloud computing research work in the MapReduce calculation framework lacks effective support to the aggregation of multi-dimensional data. On the other hand, the use of MapReduce computing framework needs to start large computing nodes, and costs huge amounts of energy. For the ab...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 7 شماره
صفحات -
تاریخ انتشار 2013